Racial discrimination continues to be pervasive in cultures throughout the world. Researchers examined the level of racial discrimination in the United States labor market by randomly assigning identical résumés black-sounding or white-sounding names and observing the impact on requests for interviews from employers.
In the dataset provided, each row represents a resume. The 'race' column has two values, 'b' and 'w', indicating black-sounding and white-sounding. The column 'call' has two values, 1 and 0, indicating whether the resume received a call from employers or not.
Note that the 'b' and 'w' values in race are assigned randomly to the resumes.
You will perform a statistical analysis to establish whether race has a significant impact on the rate of callbacks for resumes.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [1]:
%matplotlib inline
from __future__ import division
import matplotlib
matplotlib.rcParams['figure.figsize'] = (15.0,5.0)
import pandas as pd
import numpy as np
from scipy import stats
In [2]:
data = pd.io.stata.read_stata('data/us_job_market_discrimination.dta')
print "Total count: ",len(data)
print "race == 'b': ",len(data[data.race=='b'])
print "race == 'w': ",len(data[data.race=='w'])
In [3]:
data.head()
Out[3]:
In [4]:
# number of callbacks and proportion of callbacks
print "Callback count for black-sounding names: ",sum(data[data.race=='b'].call)
print "Callback proportion for black-sounding names: ",sum(data[data.race=='b'].call)/len(data[data.race=='b'])
print "Callback count for white-sounding names: ",sum(data[data.race=='w'].call)
print "Callback proportion for white-sounding names: ",sum(data[data.race=='w'].call)/len(data[data.race=='w'])
The outcome variable here is binary, so this might be treated in several ways. First, it might be possible to apply the normal approximation to the binomial distribution. In this case, the distribution proportions is $\mathcal{N}(np,np(1-p))$
There are a number of guidelines as to whether this is a suitable approximation (see Wikipedia for a list of such conditions), some of which include:
But these conditions can be roughly summed up as not too small of a sample and an estimated proportion far enough from 0 and 1 that the distribution isn't overly skewed. If the normal approximation is reasonable, a z-test can be used, with the following standard error calculation:
$$SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}$$where $$\hat{p}=\frac{np_1+np_2}{n_1+n_2}$$
giving
$$z = \frac{p_1-p2}{SE}$$
In [5]:
xb = sum(data[data.race=='b'].call)
nb = len(data[data.race=='b'])
xw = sum(data[data.race=='w'].call)
nw = len(data[data.race=='w'])
pHat = (nb*(xb/nb) + nw*(xw/nw))/(nb+nw)
se = np.sqrt(pHat*(1-pHat)*(1/nb + 1/nw))
z = (xb/nb -xw/nw)/se
print "z-score:",round(z,3),"p =", round(stats.norm.sf(abs(z))*2,6)
So, the difference in probability of a call-back is statistically significant here.
Plotting the distribution for call-backs with black-sounding names, it looks fairly symmetrical and well-behaved, so it's quite likely that the normal approximation is fairly reasonable here.
In [6]:
pb = xb/nb
x = np.arange(110,210)
matplotlib.pyplot.vlines(x,0,stats.binom.pmf(x,nb,pb))
Out[6]:
Because the normal distribution is only an approximation, the assumptions don't always work out for a particular data set. There are several methods for calculating confidence intervals around the estimated proportion. For example, with a significance level of $\alpha$, the Jeffrey's interval is defined as the $\frac{\alpha}{2}$ and 1-$\frac{\alpha}{2}$ quantiles of a beta$(x+\frac{1}{2}, n-x+\frac{1}{2})$ distribution. Using scipy:
In [7]:
intervalB = (stats.beta.ppf(0.025,xb+0.5,nb-xb+0.5),stats.beta.ppf(0.975,xb+0.5,nb-xb+0.5))
intervalW = (stats.beta.ppf(0.025,xw+0.5,nw-xw+0.5),stats.beta.ppf(0.975,xw+0.5,nw-xw+0.5))
print "Interval for black-sounding names: ",map(lambda x: round(x,3),intervalB)
print "Interval for white-sounding names: ",map(lambda x: round(x,3),intervalW)
The complete lack of overlap in the intervals here implies a significant difference with $p\lt 0.05$ (Cumming & Finch,2005). Given that this particular interval can be interpreted as a Bayesian credible interval, this is a fairly comfortable conclusion.
In [8]:
import pystan
In [9]:
modelCode = '''
data {
int<lower=0> N;
int<lower=1,upper=2> G[N];
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta[2];
}
model {
# beta(0.5,0.5) prior
theta ~ beta(0.5,0.5);
# bernoulli likelihood
# This could be modified to use a binomial with successes and counts instead
for (i in 1:N)
y[i] ~ bernoulli(theta[G[i]]);
}
generated quantities {
real diff;
// difference in proportions:
diff <- theta[1]-theta[2];
}
'''
model = pystan.StanModel(model_code=modelCode)
In [10]:
dataDict = dict(N=len(data),G=np.where(data.race=='b',1,2),y=map(int,data.call))
fit = model.sampling(data=dataDict)
In [11]:
print fit
In [12]:
samples = fit.extract(permuted=True)
MCMCIntervalB = np.percentile(samples['theta'].transpose()[0],[2.5,97.5])
MCMCIntervalW = np.percentile(samples['theta'].transpose()[1],[2.5,97.5])
fit.plot().show()
Estimating rough 95% credible intervals:
In [13]:
print map(lambda x: round(x,3),MCMCIntervalB)
print map(lambda x: round(x,3),MCMCIntervalW)
So, this method gives a result that fits quite nicely with previous results, while allowing more flexible specification of priors.
Interval for sampled differences in proportions:
In [14]:
print map(lambda x: round(x,3),np.percentile(samples['diff'],[2.5,97.5]))
And this interval does not include 0, so that we're left fairly confident that black-sounding names get less call-backs, although the estimated differences in proportions are fairly small (significant in the technical sense isn't really the right word to describe this part).
A next step here would be to check whether other factors influence the proportion of call-backs. This can be done using logistic regression, although there will be a limit to the complexity of the model to be fit, given that the proportion of call-backs is quite small, potentially leading to small cell-counts and unstable estimates (one rule of thumb being n>30 per cell is reasonably safe).
In [15]:
data.columns
Out[15]:
In [16]:
# The data is balanced by design, and this mostly isn't a problem for relatively simple models.
# For example:
pd.crosstab(data.computerskills,data.race)
Out[16]:
In [17]:
import statsmodels.formula.api as smf
Checking to see if computer skills have a significant effect on call-backs:
In [18]:
glm = smf.Logit.from_formula(formula="call~race+computerskills",data=data).fit()
glm.summary()
Out[18]:
The effect might be described as marginal, but probably best not to over-interpret. But maybe the combination of race and computer skills makes a difference? Apparently not in this data (not even an improvement to the model log-likelihood or other measures of model fit):
In [19]:
glm2 = smf.Logit.from_formula(formula="call~race*computerskills",data=data).fit()
glm2.summary()
Out[19]:
But, there's still rather a lot of stuff left to explore in this data.